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Abstract. Deep learning research aims at discovering learning algorithms that discover multiple levels of dis- 
tributed representations, with higher levels representing more abstract concepts. Although the study of deep 
learning has already led to impressive theoretical results, learning algorithms and breakthrough experiments, sev- 
eral challenges lie ahead. This paper proposes to examine some of these challenges, centering on the questions 
of scaling deep learning algorithms to much larger models and datasets, reducing optimization difficulties due 
to ill-conditioning or local minima, designing more efficient and powerful inference and sampling procedures, 
and learning to disentangle the factors of variation underlying the observed data. It also proposes a few forward- 
looking research directions aimed at overcoming these challenges. 
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1 Background on Deep Learning 

Deep learning is an emerging approach within the ma- 
chine learning research community. Deep learning algo- 
rithms have been proposed in recent years to move ma- 
chine learning systems towards the discovery of multiple 
levels of representation. They have had important em- 
pirical successes in a number of traditional AI applica- 
tions such as computer vision and natural langua ge pro- 
cessing. See (Bengio, 2009; B engio et al.[\2013a> for re- 
views and Bengiol ( 2013d) and the other chapters of the 
book ( Montavon and Mullen, 120121) for practical guide- 
lines. Deep learning is attracting much attention both 
from the academic and industrial communities. Com- 
panies like Google, Microsoft, Apple, IBM and Baidu 
are investing in deep learning, with the first widely dis- 
tributed products being used by consumers aimed at 
speech recognition. Deep learning is also used for object 
recognition (Google Goggles), image and music infor- 
mation retrieval (Google Image Sear ch, Google Mus ic), 
as well as computational advertising (ICorradoi 120121) . A 
deep learning building block (the restricted Boltzmann 
machine, or RBM) was used as a crucial part of the 
winning entry of a million-dollar m achine learning com- 



petition (the Netflix competition) ( Salakhut dinov et al. 



2007; Toscher et al. 



2009). The New York Times cov- 
ered the subject twice in 2012, with front-page arti- 
clesQ Another series of articles (including a third New 
York Times article) covered a more recent event show- 
ing off the application of deep learning in a major Kaggle 
competition for drug discovery (for example see "Deep 
Learning - The Biggest Data Science Breakthrough of 
the Decade'0. Much more recently, Google bought out 



|http:// 



m/2012/11/24/ 



-deep-learning-a-part-of -art i f icial-int elligence . html 
|http: //oreillynet ■ com/pub/e/2538| 



("acqui-hired") a company (DNNresearch) created by 
University of Toronto professor Geoffrey Hinton (the 
founder and leading researcher of deep learning) and two 
of his PhD students, Ilya Sutskever and Alex Krizhevsky, 
with the press writing titles such as "Google Hires Brains 
that Helped Supercharge Machine Learning" (Robert 
McMillan for Wired, March 13th, 2013). 

The performance of many machine learning methods 
is heavily dependent on the choice of data representa- 
tion (or features) on which they are applied. For that 
reason, much of the actual effort in deploying machine 
learning algorithms goes into the design of preprocess- 
ing pipelines that result in a hand-crafted representation 
of the data that can support effective machine learning. 
Such feature engineering is important but labor-intensive 
and highlights the weakness of many traditional learn- 
ing algorithms: their inability to extract and organize the 
discriminative information from the data. Feature engi- 
neering is a way to take advantage of human ingenu- 
ity and prior knowledge to compensate for that weak- 
ness. In order to expand the scope and ease of appli- 
cability of machine learning, it would be highly desir- 
able to make learning algorithms less dependent on fea- 
ture engineering, so that novel applications could be con- 
structed faster, and more importantly for the author, to 
make progress towards artificial intelligence (AI). 

A representation learning algorithm discovers ex- 
planatory factors or features. A deep learning algorithm 
is a particular kind of representation learning proce- 
dure that discovers multiple levels of representation, with 
higher-level features representing more abstract aspects 
of the data. This area of research was kick-started in 
2006 by a few research groups, starting with Geoff Hin- 
ton's group, who initially focused on stacking unsuper- 
vised representation learning algorithms to obtain deeper 
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representations dHinton et al. , 2006 ; Bengio et al. , 2007 



Ranzato et al., 2007; Leeefa/., 2008). Since then, this 



area has seen rapid growth, with an increasing num- 
ber of workshops (now one every year at the NIPS and 
ICML conferences, the two major conferences in ma- 
chine learning) and even a new specialized conference 
just created in 2013 (ICLR - the International Confer- 
ence on Learning Representations). 

Transfer learning is the ability of a learning algo- 
rithm to exploit commonalities between different learn- 
ing tasks in order to share statistical strength, and trans- 
fer knowledge across tasks. Among the achievements 
of unsupervised representation learning algorithms are 
the impressive successes they obtained at the two trans- 
fer learning challenges held in 2011. First, the Transfer 
Learning Challenge, presented at an ICML 2011 work- 
shop of the same name, was won using unsupervised 



layer- wise pre-training (Bengio] 1201 ll iMesnil et al. 



201 II) . A second Transfer Learning Challeng e was held 



the same year and won by iGoodfellow et al.\ (1201 11) us- 
ing unsupervised representation learning. Results were 
presented at NIPS 201 l's Challenges in Learning Hier- 
archical Models Workshop. 



Quick Overview of Deep Learning 
Algorithms 



The central concept behind all deep learning methodol- 
ogy is the automated discovery of abstraction, with the 
belief that more abstract representations of data such as 
images, video and audio signals tend to be more use- 
ful: they represent the semantic content of the data, di- 
vorced from the low-level features of the raw data (e.g., 
pixels, voxels, or waveforms). Deep architectures lead to 
abstract representations because more abstract concepts 
can often be constructed in terms of less abstract ones. 

Deep learning algorithms are special cases of repre- 
sentation learning with the property that they learn mul- 
tiple levels of representation. Deep learning algorithms 
often employ shallow (single-layer) representation learn- 
ing algorithms as subroutines. Before covering the unsu- 
pervised representation learning algorithms, we quickly 
review the basic principles behind supervised representa- 
tion learning algorithms such as the good old multi-layer 
neural networks. Supervised and unsupervised objectives 
can of course be combined (simply added, with a hyper- 
param eter as coefficient), like in lLarochelle and Bengio 
(2008)'s discriminative RBM. 



2.1 Deep Supervised Nets, Convolutional Nets, 
Dropout 

Before 2006, it was b elieved that training dee p super- 
vised neural networks (Rumelhart et al.. 1986) was too 



difficult (and indeed did not work). The first break- 
through in training them happened in Geof f Hinton's lab 



with unsupervised pre-training by RBMs (IHinton et al. 
2006), as discussed in the next subsection. How- 
ever, more recently, it was discovered that one could 
train deep supervised nets by proper initialization, just 
large enough for gradients to flo w well and activa- 
tions t o convey u s eful in formation dGlorot and Bengio , 



2010L ISutskever[ Hq12)@ Another 



of Rrizhevskv et al. 



interesting 
the deep 



mgre- 
super- 



dient in the suc cess of training 
vis ed networks of Glorot and Bengio ( 2010h (and later 



d2012l) ) is the presence of rec- 
tifying non-linearities (such as max(0, x)) instead of 
sigmoidal non-li nearities (such as 1 / (1 + exp(— x)) 
or tan hfx)). See Jarrett et al. ( 20091) : iNair and Hintonl 
(2010) for earlier work on rectifier-like non-linearities. 
We return to this topic in Section [4] These good 
results with purely supervised training of deep nets 
seem to be especially clear when large quantities of 
labeled data are available, and it was demonstrated 



ject recognition (IKrizhevskv et all 1201 21) with break- 



with great success for speech recognition (ISeide et al. 
201 lat IHinton et al.l 1201 2at iDeng et all 120131) and ob- 



throughs reducing the previous state-of-the-art error 
rates by 30% to 50% on difficult to beat benchmarks. 

One of the key ingredients for success in the applica- 
tions of deep learnin g to speech, images, and natu r al lan 



guage processing dBengio . 2008; Collo bert et all 2011) 
is the use of convolutional architectures (ILeCun et al. 



1998b), which alternate convolutional layers and pool 
ing layers. Units on hidden layers of a convolutional net- 
work are associated with a spatial or temporal position 
and only depend on (or generate) the values in a particu- 
lar window of the raw input. Furthermore, units on con- 
volutional layers share parameters with other units of the 
same "type" located at different positions, while at each 
location one finds all the different types of units. Units on 
pooling layers aggregate the outputs of units at a lower 
layer, either aggregating over different nearby spatial po- 
sitions (to achieve a form of local spatial invariance) or 
over different unit types. For example, a max-pooling 
unit outputs the maximum over some lower level units, 
which can therefore be seen to compete towards sending 
their signal forward. 

Another key ingredient in the success of many re- 
cent breakthrough results in the area of objec t recog- 



nition is the idea of dropouts (IHinton et al. . 1201 2bi 



: and potentially with the use of momentum dSutskeveiil20T2h 
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Krizhevskv et all 120121; IGoodfellow et all l2013bl) . In- 
terestingly, it consists in injecting noise (randomly drop- 
ping out units with probability 1/2 from the neural net- 
work during training, and correspondingly multiplying 
by 1 /2 the weights magnitude at test time) that prevents 
a too strong co-adaptation of hidden units: hidden units 
must compute a feature that will be useful even when 
half of the other hidden units are stochastically turned 
off (masked). This acts like a powerful regularizer that 
is similar to bagging aggregation but over an exponen- 
tially large number of models (corresponding to differ- 
ent masking patterns, i.e., subsets of the overall network) 
that share parameters. 



2.2 Unsupervised or Supervised Layer-wise 
Pre-Training 

One of the key results of recent years of research in 
deep learning is that deep compositions of non-linearities 
- such as found in deep feedforward networks or in 
recurrent networks applied over long sequences - can 
be very sensitive to initialization (some initializations 
can lead much better or much worse results after train- 
ing). The first type of approaches that were found useful 
to reduce tha t sensitivity i s based on greedy layer-wis e 
pre-training dHinton et al. , l2006t iBengio et all 120071) . 
The idea is to train one layer at a time, starting from 
lower layers (on top of the input), so that there is a 
clear training objective for the currently added layer 
(which typically avoids the need for back-propagating 
error gradients through many layers of non-linearities). 
With unsupervised pre-training, each layer is trained to 
model the distribution of values produced as output of 
the previous layer. As a side-effect of this training, a 
new representation is produced, which can be used as 
input for deeper la yers. With the less common super- 



vised pre-training feengio et al. , 2007 : Yu et al. , 2010t 



Seide et aZll2011bl) . each additional layer is trained with 
a supervised objective (as part of a one hidden layer 
network). Again, we obtain a new representation (e.g., 
the hidden or output layer of the newly trained super- 
vised model) that can be re-used as input for deeper 
layers. The effect of unsupervised pre-training is ap- 
parently most drastic in the context of trainin g deep 
auto-encoders ( Hinton and Sal akhutdinov. 2006), unsu- 
pervised learners that learn to reconstruct their input: un- 
supervised pre-training allows to find much lower train- 
ing and test reconstruction error. 

2.3 Directed and Undirected Graphical Models 
with Anonymous Latent Variables 

Anonymous latent variables are latent variables that do 
not have a predefined semantics in terms of predefined 



human-interpretable concepts. Instead they are meant as 
a means for the computer to discover underlying ex- 
planatory factors present in the data. We believe that 
although non-anonymous latent variables can be very 
useful when there is sufficient prior knowledge to de- 
fine them, anonymous latent variables are very useful 
to let the machine discover complex probabilistic struc- 
ture: they lend flexibility to the model, allowing an oth- 
erwise parametric model to non-parametrically adapt to 
the amount of data when more anonymous variables are 
introduced in the model. 

Principal components analysis (PCA), independent 
components analysis (ICA), and sparse coding all cor- 
respond to a directed graphical model in which the ob- 
served vector x is generated by first independently sam- 
pling some underlying factors (put in vector h) and then 
obtaining x by Wh plus some noise. They only differ 
in the type of prior put on h, and the corresponding in- 
ference procedures to recover h (its posterior P(h | x) 
or expected value K[h \ x]) when x is observed. Sparse 
coding tends to yield many zeros in the estimated vec- 
tor h th at could have generate d the observed x. See sec- 
tion 3 of Bengio et al.\ i 2013ch for a review of representa- 
tion learning procedures based on directed or undirected 
graphical modelsQ Section |2"31 describes sparse coding 
in more detail. 

An important thing to keep in mind is that directed 
graphical models tend to enjoy the property that in com- 
puting the posterior, the different factors compete with 
each other, through the celebrated explaining away ef- 
fect. Unfortunately, except in very special cases (e.g., 
when the columns of W are orthogonal, which elimi- 
nates explaining away and its need), this results in com- 
putationally expensive inference. Although maximum a 
posteriori (MAP) inferenc^f] remains polynomial-time in 
the case of sparse coding, this is still very expensive, and 
unnecessary in other types of models (such as the stacked 
auto-encoders discussed below). In fact, exact inference 
becomes intractable for deeper models, as discussed in 
section [5] 

Although RBMs enjoy tractable inference, this is ob- 
tained at the cost of a lack of explaining away be- 
tween the hidden units, which could potentially limit 
the representational power of E[h \ x] as a good rep- 
resentation for the factors that could have generated x. 
However, RBMs are often used as building blocks for 
training deeper grap hical models s uch a s the deep be- 
lief network (DBN) ( Hinton et all 120061) and the deep 



Directed and undirected: just two different views on the se- 
mantics of probabilistic models, not mutually exclusive, but 
views that are more convenient for some models than others. 
5 finding h that approximately maximizes P(h \ x) 
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Boltzmann machine (DBM) ( Salakhutd inov and Hintonl 
2009), which can compensate for the lack of explain- 
ing away in the RBM hidden units via a rich prior (pro- 
vided by the upper layers) which can introduce poten- 
tially complex interactions and competition between the 
hidden units. Note that there is explaining away (and in- 
tractable exact inference) in DBNs and something anal- 
ogous in DBMs. 



2.4 Regularized Auto-Encoders 

Auto-encoders include in their training criterion a form 
of reconstruction error, such as | \r(x)— x\ | 2 , where r(-) is 
the learned reconstruction function, often decomposed as 
r(x) = g(f(x)) where /(•) is an encoding function and 
<?(•) a decoding function. The idea is that auto-encoders 
should have low reconstruction error at the training ex- 
amples, but high reconstruction error in most other con- 
figurations of the input. In the case of auto-encoders, 
good generalization means that test examples (sampled 
from the same distribution as training examples) also get 
low reconstruction error. Auto-encoders have to be regu- 
larized to prevent them from simply learning the identity 
function r(x) = x, which would be useless. Regularized 
auto-encoders include the old bottleneck auto-encoders 
(like in PC A) with less hidden units than input, as well 
as the denoising auto-encoders dVincent et adl 2008) and 
contractive auto-encoders (IRifai et ail 1201 lab - The de- 
noising auto-encoder takes a noisy version N(x) of orig- 
inal input x and tries to reconstruct x, e.g., it mini- 
mizes \\r(N(x)) — x\\ 2 . The contractive auto-encoder 
has a regularization penalty in addition to the recon- 
struction error, trying to make hidden units f(x) as 
constant as possible with respect to x (minimizing 



estimate the scorqj o f the underlying data generati ng dis 



the contractive penalty 



I df(x) 
dx 



\p). A Taylor expansion 



of the denoising error shows that it is also approx- 
imately equivalent to minimizing r econstruction error 
plus a contractive pena lty on r(-) (Alain and B engio. 
20121) . As explained in Bengio et al. d2013cl) . the tug- 
of-war between minimization of reconstruction error 
and the regularizer means that the intermediate rep- 
resentation must mostly capture the variations neces- 
sary to distinguish training examples, i.e., the direc- 
tions of variations on the manifold (a lower dimen- 
sional region) near which the data generating distri- 
bution concentrates. Score matching (IHvvarinenl 120051) 



is an inductive principle that can be an interesting al- 
ternative to maximum likelihood, and several connec- 
tions have been drawn between reconstruction error 



in au to-encoders and score matching (ISwersky et al. 



201 II) . It has also been shown that denoising auto- 
encoders and some forms of contractive auto-encoders 



tribution (IVincentl 1201 It lAlain and Ben gio. 2012). This 
can be used to endow regularized auto-encoders with 
a probabilistic interpretation and t o sample f r om the 



implicitl y learned density model s (Rifai et al. . l2012bt 



Bengio et alx 1201 2t lAlain and BengicJ, laOlJ through 



some variant of Langevin or Metropolis-Hastings Monte- 
Carlo Markov chains (MCMC). 

Even though there is a probabilistic interpretation to 
regularized auto-encoders, this interpretation does not 
involve the definition of intermediate anonymous latent 
variables. Instead, they are based on the construction of 
a direct parametrization of an encoding function which 
immediately maps an input x to its representation f(x), 
and they are motivated by geometrical co nsiderations in 
the spi rit of manifold learning algorithms ( Bengio et alx 
2013a) . Consequently, there is no issue of tractability 
of inference, even with deep auto-encoders obtained by 
stacking single-layer ones. 



2008), in 



It was previously believed dRanzato et al. 
eluding by the author himself, that reconstruction error 
should only be small where the estimated density has a 
peak, e.g., near th e data. However, recent t heoretical and 
empirical results ( Alain and Bengio! 2012 ) show that the 
reconstruction error will be small where the estimated 
density has a peak (a mode) but also where it has a trough 
(a minimum). This is because the reconstruction error 
vector (reconstruction minus input) estimates the score 
diogp(z) • tne recons truction error is small where 

ox ' ' 

1 1 a lo |^ 1 1 is small. This can happen at a local maxi- 
mum but also at a local minimum (or saddle point) of the 
estimated density. This argues against using reconstruc- 
tion error itself as an energy functional which should only 
be low near high probability points. 



2.5 Sparse Coding and PSD 



Sparse coding (Ols hausen and Field , 1996) 



is a par- 
ticular kind of directed graphical model with a lin- 
ear relationship between visible and latent variables 
(like in PCA), but in which the latent variables have 
a prior (e.g., Laplace density) that encourages spar- 
sity (many zeros) in the MAP posterior. Sparse cod- 
ing is not actually very good as a generative model, 
but has been very successful for unsupervised fea- 



ture learning ( 


Raina et al. . 2007: ICoates and N2L 


2011 


Yuefaf.,2011 


; Grosse et ali 2007t Jenatton et al.. 


2009 



6 derivative of the log-density with respect to the data; this 
is different from the usual definition of score in statistics, 
where the derivative is with respect to the parameters 

7 To define energy, we write probability as the normalized ex- 
ponential of minus the energy. 
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Bach et all 1201 lb . See iBengio et al.\ d2013d) for a brief 
overview in the context of deep learning, along with 
connections to other unsupervised representation learn- 
ing algorithms. Like other directed graphical models, 
it requires somewhat expensive inference, but the good 
news is that for sparse coding, MAP inference is a con- 
vex optimization problem for whi ch several fast ap- 



proximations have been p roposed (IMairal et al. . l2009t 



Gregor and LeCun , 2010al). It is int e resting to note the 
results obtained by Coates and Ngl ( 201 1 ) which sug- 
gest that sparse coding is a better encoder but not a 
better learning algorithm than RBMs and sparse auto- 
encoders (none of which has explaining away). Note also 
that sparse coding can be gene ralized into the spike-and - 
slab sparse coding algorithm (IGoodfellow et all 120121) . 
in which MAP inference is replaced by variational infer- 
ence, and that was used to win the NIPS 2 011 transfer 
learning challenge (IGoodfellow et a/.ll201 lb . 

Another interesting variant on sparse cod- 
ing is the predictive sp a rse c oding (PSD) algo- 



rithm (IKavukcuoglu et all 120081) and its variants, 



which combine properties of sparse coding and of 
auto-encoders. Sparse coding can be seen as having 
only a parametric "generative" decoder (which maps 
latent variable values to visible variable values) and 
a non-parametric encoder (find the latent variables 
value that minimizes reconstruction error and minus the 
log-prior on the latent variable). PSD adds a parametric 
encoder (just an affine transformation followed by a 
non-linearity) and leams it jointly with the generative 
model, such that the output of the parametric encoder is 
close to the latent variable values that reconstructs well 
the input. 

3 Scaling Computations 

From a computation point of view, how do we scale the 
recent successes of deep learning to much larger mod- 
els and huge datasets, such that the models are actually 
richer and capture a very large amount of information? 

3.1 Scaling Computations: The Challenge 

The beginnings of deep learning in 2006 have fo- 
cuse d on the MNIST d i git image classifica tion prob- 
lem ( Hinton et all 120061 IBengio et al. , 2007), breaking 
the supremacy of SVMs (1.4% error) on this datasetj^ 



i for the knowledge-free version of the task, where no image- 
specific prior is used, such as image deformations or convo- 
lutions, where the current state-of-the - art is ar o und 0.8% and 
involve s deep learning dRifai et all 1201 lbl : | Hinton et all 
l2012bh . 



The latest records are still held by deep networks: 



Ciresan et al.\ (|2012) currently claim the title of state- 
of-the-art for the unconstrained version of the task (e.g., 
using a convolutional architecture and stochastically de- 
formed data), with 0.27% error. 

In the last few years, deep learning has moved 
from digits to object recognition in natural images, 
and the latest breakthrough has been achieved on 
the ImageNet dataset@ bringing down the state-of- 
the-art error rate (out of 5 g uesses) from 26.1% to 
15.3% dKrizhevskvefartl2012l) 

To achieve the above scaling from 28x28 grey-level 
MNIST images to 256x256 RGB images, researchers 
have taken advantage of convolutional architectures 
(meaning that hidden units do not need to be connected 
to all units at the previous layer but only to those in 
the same spatial area, and that pooling units reduce the 
spatial resolution as we move from lower to higher lay- 
ers). They have also taken advantage of GPU technology 
to spee d-up computation by one or two o r ders of mag- 
nitude jRaina et all l2009t iBergstra et all l20ld 12011 
Krizhevsky et q/.ll2012h 



We can expect computational power to continue to 
increase, mostly through increased parallelism such as 
seen in GPUs, multicore machines, and clusters. In addi- 
tion, computer memory has become much more afford- 
able, allowing (at least on CPUs) to handle potentially 
huge models (in terms of capacity). 

However, whereas the task of recognizing handwritten 
digits is solved to the point of achieving roughly human- 
level performance, this is far from true for tasks such as 
general object recognition, scene understanding, speech 
recognition, or natural language understanding. What is 
needed to nail those tasks and scale to even more ambi- 
tious ones? 

As we approach Al-scale tasks, it should become clear 
that our trained models will need to be much larger 
in terms of number of parameters. This is suggested 
by two observations. First, Al means understanding the 
world around us at roughly the same level of compe- 
tence as humans. Extrapolating from the current state 
of machine learning, the amount of knowledge this rep- 
resents is bound to be large, many times more than 
what current models can capture. Second, more and 
more empirical results with deep learnin g suggest that 



larger models systematically work better dCoates et al. 



2011 



Iffinton et all 



2012b; iKrizhevsky etaU. 12012 



IGoodfellow et a/.ll2013bl) . provided appropriate regular- 



The 1000-class ImageNet benchmark, whose results are 
detailed here: 



http: //www. image-net . org/challenges/LSVRC/2 012/ 
results . html 
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ization is used, such as the dropouts technique described 
above. 

Part of the challenge is that the current capabilities 
of a single computer are not sufficient to achieve these 
goals, even if we assume that training complexity would 
scale linearly with the complexity of the task. This has 
for ex ample motivated the work of the Google Brain 



team ( Le et ali 12.012c iDean et ali 120121) to parallelize 



training of deep nets over a very large number of nodes. 
As we will see in Section [4] we hypothesize that as the 
size of the models increases, our current ways of train- 
ing deep networks become less and less efficient, so that 
the computation required to train larger models (to cap- 
ture correspondingly more information) is likely to sca le 
much worse than linearly (IDauphin and Bengio[|2013l) . 

Another part of the challenge is that the increase in 
computational power has been mostly coming (and will 
continue to come) from parallel computing. Unfortu- 
nately, when considering very large datasets, our most 
efficient training algorithms for deep learning (such as 
variations on stochastic gradient descent or SGD) are in- 
herently sequential (each update of the parameters re- 
quires having completed the previous update, so they 
cannot be trivially parallelized). Furthermore, for some 
tasks, the amount of available data available is becom- 
ing so large that it does not fit on a disk or even on a 
file server, so that it is not clear how a single CPU core 
could even scan all that data (which seems necessary in 
order to leam from it and exploit all of it, if training is 
inherently sequential). 

3.2 Scaling Computations: Solution Paths 

Parallel Updates: Asynchron ous SG D. One idea that 
we explored in iBengio et al.\ (12003b is that of asyn- 
chronous SGD: train multiple versions of the model 
in parallel, each running on a different node and see- 
ing different subsets of the data (on different disks), 
but with an asynchronous lock-free sharing mechanism 
which keeps the different versions of the model not too 
far from each other. If the sharing were synchronous, 
it would be too inefficient because most nodes would 
spend their time waiting for the sharing to be completed 
and would be waiting for the slowest of the nodes. This 
idea has been analyzed theoretically (IRecht et ali 1201 II) 
and succes sfully engineered on a grand sca le recently 
at Google cLe et all \2Q 1 2t IDean et ali 12012b . However, 
current large-scale implementations (with thousands of 
nodes) are still very inefficient (in terms of use of the 
parallel resources), mostly because of the communica- 
tion bottleneck requiring to regularly exchange param- 
eter values between nodes. The above papers also take 
advantage of a way to train deep networks which has 



been very successful for GPU implementations, namely 
the use of rather large minibatches (blocks of examples 
after which an update is performed), making some paral- 
lelization (across the exa mples in the miniba tch) easier. 
One option, explored by [Coates et al.\ (120121) is to use 
as building blocks for learning features algorithms such 
as k-means that can be run efficiently over large mini- 
batches (or the whole data) and thus parallelized easily 
on a cluster (they learned 150,000 features on a cluster 
with only 30 machines). 

Another interesting consideration is the optimization 
of trade-off between communication cost and computa- 
tion cost in distributed optimiz ation algorithms, e.g., as 
discussed in lTsianos et al.\ d2012l) . 



Sparse Updates. One idea that we propose here 
is to change the learning algorithms so as to ob- 
tain sparse updates, i.e., for any particular minibatch 
there is only a small fraction of parameters that are 
updated. If the amount of sparsity in the update is 
large, this would mean that a much smaller frac- 
tion of the parameters need to be exchanged between 
nodes when performing an asynchronous SGE0- Sparse 
updates could be obtained simply if the gradient is 
very sparse. This gradient sparsity can arise with ap- 
proaches that select paths in the neural network. We al- 
ready know methods whi ch produce sl ightly sparse up- 
dates , such as dropouts dHinton et al. , l2012b)f 7T l max- 
out (Goodfellow et al[ 1201 3bH 12 l and other hard-pooling 
mechanisms, such as the rec ently proposed and very 
successful stochastic pooling (Zeil er and Fereusi |20 1 3ft . 
These methods do not provide enough sparsity, but this 
could be achieved in two ways. First of all, we could 
choose to only pay attention to the largest elements of 
the gradient vector. Second, we could change the archi- 
tecture along the lines proposed next. 



Conditional Computation. A central idea (that applies 
whether one parallelizes or not) that we put forward 
is that of conditional computation: instead of dropping 
out paths independently and at random, drop them in a 



although the gain would be reduced considerably in a mini- 
batch mode, roughly by the size of the minibatch 
where half of the hidden units are turned off, although 
clearly, this is not enough sparsity for reaching our objective; 
unfortunately, we observed that randomly and independently 
dropping a lot more than half of the units yielded substan- 
tially worse results 

where in addition to dropouts, only one out of k filters wins 
the competition in max-pooling units, and only one half of 
those survives the dropouts masking, making the sparsity 
factor 2k 
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learned and optimized way. Decision trees remain some 
of the most appealing machine learning algorithms be- 
cause prediction time can be on the order of the loga- 
rithm of the number of parameters. Instead, in most other 
machine learning predictors, scaling is linear (i.e., much 
worse). This is because decision trees exploit conditional 
computation: for a given example, as additional compu- 
tations are performed, one can discard a gradually larger 
set of parameters (and avoid performing the associated 
computation). In deep learning, this could be achieved 
by combining truly sparse activations (values not near 
zero like in sparse auto-encoders, but actual zeros) and 
multiplicative connections whereby some hidden units 
gate other hidden units (when the gater output is zero it 
turns off the output of the gated unit). When a group A of 
hidden units has a sparse activation pattern (with many 
actual zeros) and it multiplicatively gates other hidden 
units B, then only a small fraction of the hidden units in 
B may need to be actually computed, because we know 
that these values will not be used. Such gating is simi- 
lar to what happens when a decision node of a decision 
tree selects a subtree and turns off another subtree. More 
savings can thus be achieved if units in B themselves 
gate other units, etc. The crucial difference with deci- 
sion trees (and e.g., t he hard mixture of expe rts we intro- 



One issue with the other example we men tioned, hard 



mixtures of experts (ICollobert et all |2003), is that its 



duced a decade ago (ICollobert et al.i |2003)) is that the 



gating units should not be mutually exclusive and should 
instead form a distributed pattern. Indeed, we want to 
keep the advantages of distributed representations and 
avoid the l imited local genera lization suffered by deci- 
sion trees feengio et al. , 2010h . With a high level of con- 
ditional computation, some parameters are used often 
(and are well tuned) whereas other parameters are used 
very rarely, requiring more data to estimate. A trade-off 
and appropriate regularization therefore needs to be es- 
tablished which will depend on the amount of training 
signals going into each parameter. Interestingly, condi- 
tional computation also helps to achieve sparse gradi- 
ents, and the fast convergen ce of hard mixtures of ex- 
perts (Col lobert et al. , 2003) provides positive evidence 
that a side benefit of conditional computation will be eas- 
ier and faster optimization. 

Another existing example of conditional computa- 
tion and sparse gradients is with the first layer of neu- 
ral l anguage mod e ls, deep learn i ng m odels for text 



data (Beng io et al.i l2003t iBengiol 120081) . In that case, 



there is one parameter vector per word in the vocabulary, 
but each sentence only "touches" the parameters asso- 
ciated with the words in the sentence. It works because 
the input can be seen as extremely sparse. The question 
is how to perform conditional computation in the rest of 
the model. 



training mechanism only make sense when the gater op- 
erates at the output layer. In that case, it is easy to get a 
strong and clean training signal for the gater output: one 
can just evaluate what the error would have been if a dif- 
ferent expert had been chosen, and train the gater to pro- 
duce a higher output for the expert that would have pro- 
duced the smallest error (or to reduce computation and 
only interrogate two experts, require that the gater cor- 
rectly ranks their probability of being the best one). The 
challenge is how to produce training signals for gating 
units that operate in the middle of the model. One cannot 
just enumerate all the gating configurations, because in 
a distributed setting with many gating units, there will 
be an exponential number of configurations. Interest- 
ingly, this suggests introducing randomness in the gat- 
ing process itself, e.g., stochastically choosing one or 
two choices out of the many that a group of gating units 
could take. This is interesting because this is the second 
motivation (after the success of dropouts as a regularizer) 
for re-introducing randomness in the middle of deep net- 
works. This randomness would allow configurations that 
would otherwise not be selected (if only a kind of "max" 
dictated the gating decision) to be sometimes selected, 
thus allowing to accumulate a training signal about the 
value of this configuration, i.e., a training signal for the 
gater. The general question of estimating or propagating 
gradients through st ochastic neuron s is treated in another 
exploratory article (IBengiol 1201 3al) . where it is shown 
that one can obtain an unbiased (but noisy) estimator of 
the gradient of a loss through a discrete stochastic deci- 
sion. Another interesting idea explored in that paper is 
that of adding noise just before the non-linearity (max- 
pooling (maxi Xj) or rectifier (max(0, x))). Hence the 
winner is not always the same, and when a choice wins 
it has a smooth influence on the result, and that allows a 
gradient signal to be provided, pushing that winner closer 
or farther from winning the competition on another ex- 
ample. 

4 Optimization 

4.1 Optimization: The Challenge 

As we consider larger and larger datasets (growing faster 
than the size of the models), training error and general- 
ization error converge. Furthermore many pieces of evi- 
dence in the results of experiments on deep learning sug- 
gest that training deep networks (includi ng recurrent net- 
works) involves a difficult optimization (jBengiol. 12013*5 



Gulcehre and BengioL 1201 3c iBengio et all l2013al) . It is 
not yet clear how much of the difficulty is due to lo- 
cal minima and how much is due to ill-conditioning (the 
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two main types of optimization difficulties in continu- 
ous optimization problems). It is therefore interesting to 
study the optimization methods and difficulties involved 
in deep learning, for the sake of obtaining better general- 
ization. Furthermore, better optimization could also have 
an impact on scaling computations, discussed above. 

One important thing to keep in mind, though, is that in 
a deep supervised network, the top two layers (the output 
layer and the top hidden layer) can rather easily be made 
to overfit, simply by making the top hidden layer large 
enough. However, to get good generalization, what we 
have found is that one needs to optimize the lower lay- 
ers, those that are far removed from the im mediate super- 



vised training signal (Be ngio et all 120071) . These obser- 



vations mean that only looking at the training criterion is 
not sufficient to assess that a training procedure is doing 
a good job at optimizing the lower layers well. However, 
under constraints on the top hidden layer size, training 
error can be a good guide to the quality of the optimiza- 
tion of lower layers. Note that supervised deep nets are 
very similar (in terms of the optimization problem in- 
volved) to deep auto-encoders and to recurrent or recur- 
sive networks, and that properly optimizing RBMs (and 
more so deep Boltzmann machines) seems more diffi- 
cult: progress on training deep nets is therefore likely to 
be a key to training the other types of deep learning mod- 
els. 

One of the early hypotheses drawn from experiments 
with layer-wise pre-training as well as of other ex 



periments (semi-supervised embeddings (Weston etal. 



2008) and slow feature analy sis (Wiskot t and Seinowski , 



2002a; iBergstra and BengioL 120091) ) is that the training 



signal provided by backpropagated gradients is some- 
times too weak to properly train intermediate layers of a 
deep network. This is supported by the observation that 
all of these successful techniques somehow inject a train- 
ing signal into the intermediate layers, helping them to 
figure out what they should do. However, the more re- 
cent successful results with supervised learning on very 
large labeled datasets suggest that with some tweaks in 
the optimization procedure (including initialization), it 
is sometimes possible to achieve as good results with 
or without unsupervised pre-training or semi-supervised 
embedding intermediate training signals. 

4.2 Optimization: Solution Paths 

In spite of these recent encouraging results, several more 
recent experimental results again point to a fundamental 
difficulty in training intermediate and lower layers. 

Diminishing R eturn s with Larger Networks. First, 
iDauphin and B engio (1201 3l) show that with well- 



optimized SGD training, as the size of a neural net in- 
creases, the "return on investment" (number of training 
errors removed per added hidden unit) decreases, given a 
fixed number of training iterations, until the point where 
it goes below 1 (which is the return on investment that 
would be obtained by a brain-dead memory-based learn- 
ing mechanism - such as Parzen Windows - which just 
copies an incorrectly labeled example into the weights 
of the added hidden unit so as to produce just the right 
answer for that example only). This suggests that larger 
models may be fundamentally more difficult to train, 
probably because there are now more second-order in- 
teractions between the parameters, increasing the condi- 
tion number of the Hessian matrix (of second derivatives 
of model parameters with respect to the training crite- 
rion). This notion of return on investment may provide a 
useful metric by which to measure the effect of different 
methods to improve the scaling behavior of training and 
optimization procedures for deep learning. 



Intermedi ate Concepts Guidance and Curriculum. 

Second. iGulcehre and Bengiol(12013l) show that there are 
apparently simple tasks on which standard black-box 
machine learning algorithms completely fail. Even su- 
pervised and pre-trained deep networks were tested and 
failed at these tasks. These tasks have in common the 
characteristic that the correct labels are obtained by the 
composition of at least two levels of non-linearity and ab- 
straction: e.g., the first level involves the detection of ob- 
jects in a scene and the second level involves a non-linear 
logical operation on top of these (such as the detect- 
ing presence of multiple objects of the same category). 
On the other hand, the task becomes easily solvable by 
a deep network whose intermediate layer is first pre- 
trained to solve the first-level sub-task. This raises the 
question of how humans m ight learn even more abstract 
tasks, and lBengid (120 1 3bl) studies the hypothesis that the 
use of language and the evolution of culture could have 
helped humans reduce that difficulty (and gain a serious 
advantage over other less cultured animals). It would be 
interesting to explore multi-agent learning mechanisms 
inspired by the the mathematical principles behind the 
evolution of culture in order to bypass this optimiza- 
tion difficulty. The basic idea is that humans (and current 
learning algorithms) are limited to "local descent" opti- 
mization methods, that make small changes in the param- 
eter values with the effect of reducing the expected loss 
in average. This is clearly prone to the presence of local 
minima, while a more global search (in the spirit of both 
genetic and cultural evolution) could potentially reduce 
this difficulty. One hypothesis is that more abstract learn- 
ing tasks involve more challenging optimization difficul- 
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ties, which would make such global optimization algo- 
rithms necessary if we want computers to learn such ab- 
stractions from scratch. Anot her option, followi ng the 



to open a very promising door towards more efficient 
traini ng of deep networks. As confirmed experimen- 



idea of curriculum learning (IBengio et ail |2009), is to 



provide guidance ourselves to lear ning machines (as ex- 
emplifi ed in the toy example of iGulcehre and Bengio 
(Soil), by "teaching them" gradually more complex 
concepts to help them understand the world around us 
(keeping in mind that we also have to do that for humans 
and that it takes 20 years to complete). 



Changing the learning procedure and the architec- 
ture. Regarding the basic optimization difficulty of a 
single deep network, three types of solutions should 
be considered. First, there are solutions based on im- 
proved general-purpose optimization algorithms, such 
as for exa mple the re c ent w ork on adaptive learn- 
ing rates (jSchaul et all [2012h. online natural gra di- 
ent dLe Roux et ali 12008: Pas canu and Bengioll2013l) o r 
large-minibatch second order methods ( Martensf 2010 ). 

Another class of attacks on the optimization problem 
is based on changing the architecture (family of func- 
tions and its parametrization) or the way that the out- 
puts are produced (f or example b y addin g noise). As 



already introduced in iLeCun et al. dl998al) . changes in 



the preprocessing, training objective and architecture can 
change the difficulty of optimization, and in particularly 
improve the conditioning of the Hessian matrix (of sec- 
ond derivatives of the loss with respect to parameters). 
With gradient descent, training time into a quadratic 
bowl is roughly proportional to the condition number of 
the Hessian matrix (ratio of larg e st to sm allest eigen- 
value). For example ILeCun et al.\ (I1998al) recommends 
centering and normalizing the inputs, an idea recently ex- 
tende d to hidden layers of Boltzman n machines with suc- 
cess ( Montavon and Mullei , 20121) . A related idea that 
may have an impact on ill-conditioning is the idea of 
skip-connections, which forces both the mean output and 
the mean slope of each hidden u nit of a deep multilayer 
network to be zero (Raik o et q/.[l2012l). a centering idea 
which originates from Schraudolph ( 1998b . 

There has also been very successful recent work ex- 
ploiting recjffi^rjion-lmejaritiesy^ 



works ( Glorot et Mi 12011 at Krizhevs kv et all |2012). 



Interestingly, such non-linearities can produce rather 
sparse unit outputs, which could be exploited, if the 
amount of sparsity is sufficiently large, to consider- 
ably reduce the necessary computation (because when 
a unit output is 0, there is no need to actually mul- 
tiply it with its outgoing weights). Very recently, we 
have discovere d a variant on the rectifie r non-linearity 
called maxout (Goodfello w et al. . 2013b) which appears 



tally dGoodfellow et al. , l2013bh . maxout networks can 



train deeper networks and allow lower layers to undergo 
more training. The more general principle at stake here 
may be that when the gradient is sparse, i.e., only a small 
subset of the hidden units and parameters is touched by 
the gradient, the optimization problem may become eas- 
ier. We hypothesize that sparse gradient vectors have a 
positive effect on reducing the ill-conditioning difficulty 
involved in training deep nets. The intuition is that by 
making many terms of the gradient vector 0, one also 
knocks off many off-diagonal terms of the Hessian ma- 
trix, making this matrix more diagonal-looking, which 
would reduce many of the ill-conditioning effects in- 
volved, as explained below. Indeed, gradient descent re- 
lies on an invalid assumption: that one can modify a pa- 
rameter 8i (in the direction of the gradient with- 
out taking into account the changes in §^ that will 
take place when also modifying other parameters 6j . In- 
deed, this is precisely the information that is captured 
(e.g. with second-order methods) by the off-diagonal en- 
tries gg.gg = i- e -> how changing 9j changes 

the gradient on 9i . Whereas second-order methods may 
have their own limitation^] it would be interesting if 
substantially reduced ill-conditioning could be achieved 
by modifying the architecture and training procedure. 
Sparse gradients would be just one weapon in this line 
of attack. 

As we have argued above, adding noise in an appro- 
priate way can be useful as a powerful regularizer (as 
in dropouts), and it can also be used to make the gradi- 
ent vector sparser, which would reinforce the above pos- 
itive effect on the optimization difficulty. If some of the 
activations are also sparse (as our suggestions for con- 
ditional computation would require), then more entries 
of the gradient vector will be zeroed out, also reinforc- 
ing that beneficial optimization effect. In addition, it is 
plausible that the masking noise found in dropouts (as 
well as in denoising auto-encoders) encourages a faster 
symmetry-breaking: quickly moving away from the con- 
dition where all hidden units of a neural network or a 
Boltzmann machine do the same thing (due to a form of 
symmetry in the signals they receive), which is a non- 
attractive fixed point with a flat (up to several orders) 
likelihood function. This means that gradient descent can 
take a lot of time to pull apart hidden units which are be- 
having in a very similar way. Furthermore, when starting 



first, practical implementations never come close to actually 
inverting the Hessian, and second, they often require line 
searches that may be computationally inefficient if the opti- 
mal trajectory is highly curved 
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from small weights, these symmetry conditions (where 
many hidden units do something similar) are actually at- 
tractive from far away, because initially all the hidden 
units are trying to grab the easiest and most salient job 
(explain the gradients on the units at the layer above). 
By randomly turning off hidden units we obtain a faster 
specialization which helps training convergence. 

A related concept that has been found useful in 
understanding and reducing the training difficulty of 
deep or recurrent nets is the importance of letting 
the training signals (back-propagated gradients) flow, 
in a focused way. It is important that error signals 
flow so that credit and blame is clearly assigned to 
different components of the model, those that could 
change slightly to improve the training loss. The prob- 
lem of vanishing and exploding gradients i n recurrent 



nets (Hochreiteri Il991t iBengio et ail [1 994) arises be 



cause the effect of a long series of non-linear com- 
position tends to produce gradients that can either be 
very small (and the error signal is lost) or very large 
(and the gradient steps diverge temporarily). This idea 
has been exploited to propos e successful initia l izatio n 
procedures for deep nets ( Glorot and Bengiol 2010 ). 
A composition of non-linearities is associated with a 
product of Jacobian matrices, and a way to reduce 
the vanishing problem would be to make sure that 
they have a spectral radius (largest eigenvalue) close 
to 1, like what is done in the weight initialization for 



Echo State Networks (Jaeger, 2007) or in the carousel 
self-loop of LSTM dHochreiter and Schmidhuben 1 1997b 
to help propagation of influences over longer paths. 
A more generic way to avoid gradient vanishing is 
to incorporate a training penalty that encourages the 
propa gated gradient vectors to maintain their magni- 
tude jPascanu an d Bengic], 2012 ). When comb ined with 
a gradient clipping! 14 ! heuristic ( Mikolov , 2012 ) to avoid 
the detrimental effect of overly large gradients, it allows 
to train recurrent nets on t asks on which it was not p os- 
sible to train them before (IPascanu and Beng io. 2012). 



for a simple (typically factorial) approximate posterior 
q x (h) that is close to P(h | x), and usually involves 
an iterative optimization procedure. See a recent ma- 



chine learning textbook for m ore details (Bishop, 2006 
Barbed. 1201 UlMurphvtEoil) . 



In addition, a challenge related to inference is sam- 
pling (not just from P(h \ x) but also from P(h, x) or 
P(x)), which like inference is often needed in the in- 
ner loop of learning algorithms for probabili stic models 
with l atent variables, energy-based models (LeCun et ai, 
2006) or Markov Random Fields (IKindermannl Il980h 
(also known as undirected graphical models), where 
P(x) or P(h, x) is defined in terms of a parametrized en- 
ergy function whose normalized exponential gives prob- 
abilities. 

Deep Boltzmann ma- 
chines ( Salakhutdi nov and Hintonl 120091) combine 
the challenge of inference (for the "positive phase" 
where one tries to push the energies associated with the 
observed x down) and the challenge of sampling (for the 
"negative phase " where one tries to push up the energies 
associated with x's sampled from P(x)). Sampling for 
the negative phase is usual ly done by MCM C, al t hough 



some learning algorithm s (Collobert and Weston 



Gutmann and Hvvarinenl 



20101 iBordes ef a/J, 



2008; 



2013) 



involve "negative examples" that are sampled through 
simpler pr ocedures (like perturbations of th e observed 
input). In I Salakhutdinov and Hintonl d2009h . inference 
for the positive phase is achieved with a mean-field 
variational approximation^ 

5.1 Inference and Sampling: The Challenge 

There are several challenges involved with all of the 
these inference and sampling techniques. 

The first challenge is practical and computational: 
these are all iterative procedures that can considerably 
slow down training (because inference and/or sampling 
is often in the inner loop of learning). 



5 Inference and Sampling 

All of the graphical models studied for deep learning ex- 
cept the humble RBM require a non-trivial form of in- 
ference, i.e., guessing values of the latent variables h 
that are appropriate for the given visible input x. Sev- 
eral forms of inference have been investigated in the 
past: MAP inference is formulated like an optimization 
problem (looking for h that approximately maximizes 
P(h I x)); MCMC inference attempts to sample a se- 
quence of h's from P(h | x); variational inference looks 

14 When the norm of the gradient is above a threshold r, reduce 
it to r 



Potential Huge Number of Modes. The second chal- 
lenge is more fundamental and has to do with the po- 
tential existence of highly multi-modal posteriors: all of 
the currently known approaches to inference and sam- 
pling are making very strong explicit or implicit assump- 
tions on the form the distribution of interest (P(h | x) 

15 In the mean-field approximation, computation proceeds like 
in Gibbs sampling, but with stochastic binary values re- 
placed by their conditional expected value (probability of 
being 1), given the outputs of the other units. This determin- 
istic computation is iterated like in a recurrent network until 
convergence is approached, to obtain a marginal (factorized 
probability) approximation over all the units. 
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or P(h, x)). As we argue below, these approaches make 
sense if this target distribution is either approximately 
unimodal (MAP), (conditionally) factorizes (variational 
approximations, i.e., the different factors hi are approx- 
imately independent of each other given x), or has 
only a few modes between which it is easy to mix 
(MCMC). However, approximate inference can be po- 
tentially hurtful, not just at test time but for training, be- 
cause it is often in the inner loo p of the learning proce- 
dure dKulesza and Pereiral 120081) . 

Imagine for example that h represents many explana- 
tory variables of a rich audio-visual scene with a highly 
ambiguous raw input x, including the presence of sev- 
eral objects with ambiguous attributes or categories, such 
that one cannot really disambiguate one of the objects 
independently of the others (the so-called "structured 
output" scenario, but at the level of latent explanatory 
variables). Clearly, a factorized or unimodal representa- 
tion would be inadequate (because these variables are 
not at all independent, given x) while the number of 
modes could grow exponentially with the number of am- 
biguous factors present in the scene. For example, con- 
sider a visual scene x through a haze hiding most de- 
tails, yielding a lot of uncertainty. Say it involves 10 ob- 
jects (e.g., people), each having 5 ambiguous binary at- 
tributes (out of 20) (e.g., how they are dressed) and un- 
certainty between 100 categorical choices for each el- 
ement (e.g., out of 10000 persons in the database, the 
marginal evidence allows to reduce the uncertainty for 
each person to about 100 choices). Furthermore, suppose 
that these uncertainties cannot be factorized (e.g., people 
tend to be in the same room with other people involved 
in the same activity, and friends tend to stand physi- 
cally close to each other, and people choose to dress in a 
way that socially coherent). To make life hard on mean- 
field and other factorized approximations, this means 
that only a small fraction (say 1%) of these configura- 
tions are really compatible. So one really has to consider 
l%x (2 5 x 100) 10 « 10 33 plausible configurations of the 
latent variables. If one has to take a decision y based on 
x, e.g., P(y | x) — J2h P{v I h)P(h | x) involves sum- 
ming over a huge number of non-negligible terms of the 
posterior P{h \ x), which we can consider as modes (the 
actual dimension of h is much larger, so we have reduced 



the problem from (2 



Lid 



10000) 
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10 100 to about 



10 33 , but that is still huge. One way or another, summing 
explicitly over that many modes seems implausible, and 
assuming single mode (MAP) or a factorized distribution 
(mean-field) would yield very poor results. Under some 



' this can be relaxed by considering t r ee-str uctured condi- 
tional dependencies ( Saul and Jordan, 1996) and mixtures 
thereof 



assumptions on the underlying data-generating process, 
it might well be possible to do inference that is exact or a 
provably good approximations, and searching for graphi- 
cal models with these properties is an interesting avenue 
to deal with this problem. Basically, these assumptions 
work because we assume a specific structure in the form 
of the underlying distribution. Also, if we are lucky, a 
few Monte-Carlo samples from P(h \ x) might suffice 
to obtain an acceptable approximation for our y, because 
somehow, as far as y is concerned, many probable values 
of h yield the same answer y and a Monte-Carlo sample 
will well represent these different "types" of values of 
h. That is one form of regularity that could be exploited 
(if it exists) to approximately solve that problem. What 
if these assumptions are not appropriate to solve chal- 
lenging AI problems? Another, more general assump- 
tion (and thus one more likely to be appropriate for these 
problems) is similar to what we usually do with machine 
learning: although the space of functions is combinato- 
rially large, we are able to generalize by postulating a 
rather large and flexible family of functions (such as a 
deep neural net). Thus an interesting avenue is to assume 
that there exists a computationally tractable function that 
can compute P(y | x) in spite of the apparent complex- 
ity of going through the intermediate steps involving h, 
and that we may learn P(y \ x) through (x, y) examples. 
This idea will be developed further in Section |5T2l 

Mixing Between Modes. What about MCMC methods? 
They are hurt by the problem of mode m ixing, discussed 
at greater length in Bengio et al. ( 2013a) . and summa- 
rized here. To make the mental picture simpler, imag- 
ine that there are only two kinds of probabilities: tiny 
and high. MCMC transitions try to stay in configura- 
tions that have a high probability (because they should 
occur in the chain much more often than the tiny prob- 
ability configurations). Modes can be thought of as is- 
lands of high probability, but they may be separated 
by vast seas of tiny probability configurations. Hence, 
it is difficult for the Markov chain of MCMC methods 
to jump from one mode of the distribution to another, 
when these are separated by large low-density regions 
embedded in a high-dimensional space, a common situ- 
ation i n real-world data, and under the manifold h ypoth- 
esis dCaytonl 120051: iNarayanan and Mitten l201oh . This 
hypothesis states that natural classes present in the data 
(e.g., visual object categories) are associated with low- 
dimensional regions[3 (i.e., manifolds) near which the 
distribution concentrates, and that different class mani- 
folds are well-separated by regions of very low density. 
Here, what we consider a mode may be more than a sin- 



e.g. they can be charted with a few coordinates 
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gle point, it could be a whole (low-dimensional) mani- 
fold. Slow mixing between modes means that consecu- 
tive samples tend to be correlated (belong to the same 
mode) and that it takes a very large number of consecu- 
tive sampling steps to go from one mode to another and 
even more to cover all of them, i.e., to obtain a large 
enough representative set of samples (e.g. to compute 
an expected value under the sampled variables distri- 
bution). This happens because these jumps through the 
low-density void between modes are unlikely and rare 
events. When a learner has a poor model of the data, 
e.g., in the initial stages of learning, the model tends 
to correspond to a smoother and higher-entropy (closer 
to uniform) distribution, putting mass in larger volumes 
of input space, and in particular, between the modes (or 
manifolds). This can be visualized in generated samples 
of images, that look more blurred and noisyS Since 
MCMCs tend to make moves to nearby probable con- 
figurations, mixing between modes is therefore initially 
easy for such poor models. However, as the model im- 
proves and its corresponding distribution sharpens near 
where the data concentrate, mixing between modes be- 
comes considerably slower. Making one unlikely move 
(i.e., to a low-probability configuration) may be possi- 
ble, but making N such moves becomes exponentially 
unlikely in N. Making moves that are far and probable 
is fundamentally difficult in a high-dimensional space 
associated with a peaky distribution (because the expo- 
nentially large fraction of the far moves would be to an 
unlikely configuration), unless using additional (possibly 
learned) knowledge about the structure of the distribu- 
tion. 



5.2 Inference and Sampling: Solution Paths 

Going into a space wh ere mixing is easier. The idea 
of tempering foal 1200 II) for MCMCs is analogous to the 
idea of simulated annealing dKirkpatrick etal, 1983) for 
optimization, and it is designed for and looks very ap- 
pealing to solve the mode mixing problem: consider a 
smooth version (higher temperature, obtained by just di- 
viding the energy by a temperature greater than 1) of the 
distribution of interest; it therefore spreads probability 
mass more uniformly so one can mix between modes 
at that high temperature version of the model, and then 
gradually cool to the target distribution while continu- 
ing to make MCMC moves, to make s ure we end up in 



one of t he "islands" of high probability.LD esiardin s et al. 
d2010l) ; ICho e7aZl(l2010l) ; [Salakhutdiiiovl ( l2010blal) have 



all considered various forms of tempering to address the 
failure of Gibbs chain mixing in RBMs. Unfortunately, 
convincing solutions (in the sense of making a practical 
impact on training efficiency) have not yet been clearly 
demonstrated. It is not clear why this is so, but it may 
be due to the need to spend much time at some specific 
(critical) temperatures in order to succeed. More work is 
certainly warranted in that direction. 

2013bT) 



An interesting observation (Bengi o et al. 



which could turn out to be helpful is that after we train a 
deep model such as a DBN or a stack of regularized auto- 
encoders, we can observe that mixing between modes is 
much easier at higher levels of the hierarchy (e.g. in the 
top-level RBM or top-level auto-encoder): mixing be- 
tween modes is easier at deeper levels of representation. 
This is achieved by running the MCMC in a high-level 
representation space and then projecting back in raw in- 
put space t o obtain samples at t hat level. The hypothesis 
proposed (Ben gio et ofl l2013hh to explain this observa- 



tion is that unsupervised representation learning proce- 
dures (such as for the RBM and contractive or denoising 
auto-encoders) tend to discover a representation whose 
distribution has more entropy (the distribution of vectors 
in higher layers is more uniform) and that better "disen- 
tangles" or separates out the underlying factors of varia- 
tion (see next section for a longer discussion of the con- 
cept of disentangling). For example, suppose that a per- 
fect disentangling had been achieved that extracted the 
factors out of images of objects, such as object category, 
position, foreground color, etc. A single Gibbs step could 
thus switch a single top-level variable (like object cate- 
gory) when that variable is resampled given the others, 
a very local move in that top-level disentangled repre- 
sentation but a very far move (going to a very different 
place) in pixel space. Note that maximizing mutual in- 
formation between inputs and their learned determinis- 
tic representation, which is what auto-encoders basically 
do (IVincent et q/.| , |2008l) . is equivalent to maximizing the 
entropy of the learned representation^ which supports 
this hypothesis. An interesting ide£0 would therefore be 
to use higher levels of a deep model to help the lower 
layers mix better, by using them in a way analogous to 
parallel tempering, i.e., to suggest configurations sam- 
pled from a different mode. 

Another interesting potential avenue for solving the 
problem of sampling from a complex and rough (non- 
smooth) distribution would be to take advantage of q uan- 
tum annealing effects (IRose and Macreadyl 120071) and 
analog computing hardware (such as produced by D- 
Wave). NP-hard problems (such as sampling or optimiz- 



See examples of generated images with some of the cur- 
rent state-of-the - art in learned generativ e models of images 
(Courville et alWlOl iUluo et a/.Ll2013l) 



Salah Rifai, personal communication 
Guillaume Desjardins, personal communication 
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ing exactly in an Ising model) still require exponential 
time but experimental evidence has shown that for some 
problems, quantum a nnealing is far supe rior to standard 



digital computation (Br ooke et all 1200 lb . Since quan 



turn annealing is performed by essentially implementing 
a Boltzmann machine in analog hardware, it might be the 
case that drawing samples from a Boltzmann machine is 
one problem where quantum annealing would be dramat- 
ically superior to classical digital computing. 



Learned approximate inference and predicting a 
rich posterior. If we stick to the idea of obtaining 
actual values of the latent variables (either through 
MAP, factorized variational inference or MCMC), then 
a promising path is based on learning approximate in- 
ference, i.e., optimizing a learned approximate infer- 
ence mechanism so that it performs a better inference 
faster. This idea is not new and has been shown to 
work well in many settings. This idea w as actually al 



ready present in the wake-sleep algori thm (Hinto n et al. 



hm (IHi 

19951; [Frev eTal\ 1 19961: Iffinton et al.i 12006b in the con 



text of variational inference for Sigmoidal Belief Net- 
works and DBNs. Learned approximate inference is 
also cruc ial in the predictive spars e coding (PSD) al- 
gorithm ( Kavukcuoglu et al. . 2008h. This ap proach is 
pushed further with lGregor and LeCunl (1201 Obi) in which 
the parametric encoder has the same structural form 
as a fast iterative sparse coding approximate inference 
algorithm. The important consideration in both cases 
is not just that we have fast approximate inference, 
but that (a) it is learned, and (b) the model is learned 
jointly with t he learned approximate inference pr oce- 
dure. See also [Salakhutdi nov and Larochelle ( 2010l) for 
lea rned fast approximate variati o nal inference in DBMs 



Bagnell and Bradley! d2009h : IStoyanov et al\ d201 lh 



or 

for learning fast approximate inference (with fewer steps 
than would otherwise be required by standard general 
purpose inference) based on loopy belief propagation. 

The traditional view of probabilistic graphical mod- 
els is based on the clean separation between modeling 
(defining the model), optimization (tuning the parame- 
ters), inference (over the latent variables) and sampling 
(over all the variables, and possibly over the parameters 
as well in the Bayesian scenario). This modularization 
has clear advantages but may be suboptimal. By bringing 
learning into inference and jointly learning the approxi- 
mate inference and the "generative model" itself, one can 
hope to obtain "specialized" inference mechanisms that 
could be much more efficient and accurate than generic 
purpose o nes; thi s was the subject of a recent ICML 
workshop (Eisner , 120121) . The idea of learned approxi- 
mate inference may help deal with the first (purely com- 



putational) challenge raised above regarding inference, 
i.e., it may help to speed up inference to some extent, but 
it generally keeps the approximate inference parameters 
separate from the model parameters. 

But what about the challenge from a huge number of 
modes? What if the number of modes is too large and/or 
these are too well-separated for MCMC to visit effi- 
ciently or for variational/MAP inference to approximate 
satisfactorily? If we stick to the objective of computing 
actual values of the latent variables, the logical conclu- 
sion is that we should learn to approximate a posterior 
that is represented by a rich multi-modal distribution. To 
make things concrete, imagine that we learn (or iden- 
tify) a function f(x) of the visible variable x that com- 
putes the parameters 8 = f(x) of an approximate pos- 
terior distribution Qe=f( x )(h) but where Qe=f( x ){h) ~ 
P(h I x) can be highly multi-modal, e.g., an RBM with 
visible variables h (coupled with additional latent vari- 
ables used only to represent the richness of the poste- 
rior over h itself). Since the parameters of the RBM are 
obtained through a parametric computation taking x as 
inputF*] m is is really a condit ional RBM dTavlor et all 
2007; Taylor and Hinton[|2009h . Whereas variational in- 
ference is usually limited to a non-parametric approxi- 
mation of the posterior, Q(h) (one that is analytically and 
iteratively optimized for each given x) one could con- 
sider a parametric approximate posterior that is learned 
(or derived analytically) while allowing for a rich multi- 
modal representation (such as what an RBM can capture, 
i.e., up to an exponential number of modes). 

Avoiding inference altogether by learning to perform 
the required marginalization. We now propose to con- 
sider an even more radical departure from traditional 
thinking regarding probabilistic models with latent vari- 
ables. It is motivated by the observation that even with 
the last proposal, something like a conditional RBM to 
capture the posterior P(h | x), when one has to actually 
make a decision or a prediction, it is necessary for opti- 
mal decision-making to marginalize over the latent vari- 
ables. For example, if we want to predict y given x, we 
want to compute something like ^2 h P(y \ h)P(h \ x). 
If P{h I x) is complex and highly multi-modal (with a 
huge number of modes), then even if we can represent 
the posterior, performing this sum exactly is out of the 
question, and even an MCMC approximation may be ei- 
ther very poor (we can only visit at most N modes with 



21 for many models, such as deep Boltzmann ma- 
chines, or bipartite di s crete Markov random 
fields dMartens and Sutskeveii l2010l) . / does not even 
need to be learned, it can be derived analytically from the 
form of P(h \ x) 
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N MCMC steps, and that is very optimistic because of 
the mode mixing issue) or very slow (requiring an expo- 
nential number of terms being computed or a very very 
long MCMC chain). It seems that we have not really ad- 
dressed the original "fundamental challenge with highly 
multi-modal posteriors" raised above. 

To address this challenge, we propose to avoid ex- 
plicit inference altogether by avoiding to sample, enu- 
merate, or represent actual values of the latent vari- 
ables h. Instead, one can just directly learn to predict 
P{y | a;), in the example of the previous paragraph. 
Hence the only approximation error we are left with 
is due to to function approximation. This might be im- 
portant because the compounding of approximate infer- 
enc e with function approxim ation could be very hurt- 
ful jKulesza and PereiralEooi) . 

To get there, one may wish to mentally go through an 
intermediate step. Imagine we had a good approximate 
posterior Qg—t/ x \(h) as proposed above, with parame- 
ters 9 = f(x). Then we could imagine learning an ap- 
proximate decision model that approximates and skips 
the intractable sum over h, instead directly going from 
9 = f(x) to a prediction of y, i.e., we would estimate 
P(y | x) by g(f(x)). Now since we are already learn- 
ing f(x), why learn g(9) separately? We could simply 
directly learn to estimate ir(x) = g(f(x)) « P(y \ x). 

Now that may look trivial, because this is already what 
we do in discriminant training of deep networks or re- 
current networks, for example. And don't we lose all 
the advantages of probabilistic models, such as, handling 
different forms of uncertainty, missing inputs, and being 
able to answer any "question" of the form "predict any 
variables given any subset of the others"? Yes, if we stick 
to the traditional deep (or shallow) neural networks like 
those discussed in Section 12.1 1^1 But there are other op- 
tions. 

We propose to get the advantages of probabilistic 
models without the need for explicitly going through 
many configurations of the latent variables. Let x c be a 
subset of elements of x that are clamped, x_ c the rest 
and subset of x- c for which we have a prediction 
to make and "target" observation. We want to be able 
to sample from P(x v \ x c ). During training, for each 
observed subset s we want to maximize P(x s ), or al- 
most equivalenthj^ maximize P(x v \ x c ) for any par- 
tition (v,c) of s. The important requirement is that the 
same parameters be used to model all the predictions 



although, using something like these deep nets would be ap- 
pealing because they are currently beating benchmarks in 
speech recognition, language modeling and object recogni- 
tion 

by generalized pseudo-likelihood 



P(x v | x c ) for any choice of (v,c). For this purpose, 
we could specify a computation that maps the model pa- 
rameters to a training criterion equivalent to maximizing 
\ogP(x v | x c ). The form of this computation could be 
inspired by existing or novel inference mechanisms, as 
has been done for learned approximate inference. How- 
ever, because the training criterion would be expressed 
in terms of the observed x, the interpretation of the la- 
tent variables as latent variables in P(x, h) becomes su- 
perfluous. In fact, because we start from an approximate 
inference scheme, if we train the parameters with respect 
to some form of input reconstruction (like generalized 
pseudo-likelihood), there is no guarantee that the orig- 
inal interpretation of the estimated posterior P(h \ x) 
continues to be meaningful. What is meaningful, though, 
is the interpretation of the parameterized computational 
graph that produces P(x v | x c ) for any (v, c) pair as a 
formal definition of the learned model of the data. 

The approximate inference is not anymore an approx- 
imation of something else, it is the definition of the 
model itself. This is actually good news because we 
thus eliminate the issue that the approximate inference 
may be poor. The only thing we need to worry about 
is whether the parameterized computational graph that 
produces P(x v \ x c ) is rich enough (or may overfit) 
to capture the unknown data generating distribution, and 
whether it makes it easy or difficult to optimize the pa- 
rameters. With the mean-field variational inference, the 
computational graph looks like a recurrent neural net- 
work converging to a fixed point, and where we stop the 
iterations after a fixed number of steps or according to a 
convergence criterion. Such a trained parametrized com- 
putational graph is used in the iterative vari ational ap- 
proach introduced in lGoodfellow et al. (2013a) for train- 
ing and missing value inference in deep Boltzmann ma- 
chines, with an inpainting-like criterion in which arbi- 
trary subsets of pixels are predicted given the others (a 
generalized pseudo-likelihood criterion). It has also been 
used in a recursion that follows the template of loopy 
belief propagati on to fill-in the mis sing inputs and pro- 
duce outputs (IStovanov et all 1201 lb . Although in these 
cases there are latent variables (e.g. the latent variables 
of the deep Boltzmann machine) that motivate the "tem- 
plate" used for the learned approximate inference, what 
we propose here is to stop thinking about them as actual 
latent factors, but rather just as a way to parametrize this 
template for a question answering mechanism regard- 
ing missing inputs, i.e., the "generic conditional predic- 
tion mechanism" implemented by the recurrent compu- 
tational graph that is trained to predict any su bset of vari- 
ables g iven any other subset. Although lGoodfellow et al. 
(1201 3al) assume a factorial distribution across the pre- 
dicted variables, we propose to investigate non-factorial 
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posterior distributions over the observed variables, i.e. , 
in the spirit of the recen t flurry of work on structu red 



e.g., as has been done for structured output predictions 
when there is comp lex probabi l istic structure betwee n 



output machine learning (ITsochantaridis et q/.Ll2005h . the output variables dMnih et a/.Ll201 ltlLi et a/.L 120131) 



We can think of this parametrized computational 
graph as a family of functions, each corresponding to 
answering a different question (predict a specific set of 
variables given some others), but all sharing the same 
parameters. We already have examples of such families 
in machine learning, e.g., with recurrent neural networks 
or dynamic Bayes nets (where the functions in the fam- 
ily are indexed by the length of the sequence). This is 
also analogous to what happens with dropouts, where 
we have an exponential number of neural networks cor- 
responding to different sub-graphs from input to output 
(indexed by which hidden units are turned on or off). 
For the same reason as in these examples, we obtain a 
form of generalization across subsets. Following the idea 
of learned approximate inference, the parameters of the 
question-answering inference mechanism would be tak- 
ing advantage of the specific underlying structure in the 
data generating distribution. Instead of trying to do in- 
ference on the anonymous latent variables, it would be 
trained to do good inference only over observed variables 
or over high-level features learned by a deep architecture, 
obtained deterministically from the observed input. 

The idea that we should train with the approximate 
inference as part of the computational graph for pro- 
du cing a decision (and a loss) was first introduced 
by IStoyanov et al.\ (1201 II) . and we simply push it fur- 
ther here, by proposing to allow the computational graph 
to depart in any way we care to explore from the tem- 
plate provided by existing inference mechanism, i.e., po- 
tentially losing the connection and the reference to la- 
tent variables with a probabilistic interpretation. Once 
we free ourselves from the constraint of interpreting this 
parametrized question answering computational graph as 
corresponding to approximate inference involving latent 
variables, all kinds of architectures and parametrizations 
are possible, where current approximate inference mech- 
anisms can serve as inspiration and starting points. It 
is quite possible that this new freedom could give rise 
to much better models. The important point is that this 
mechanism is trained to do well at question answering on 
the provided data, and that it is really a family of func- 
tions indexed by all the possible question/answer sub- 
sets, but sharing their pa rameters. 

To go farther than iGoodfellow et al.\ (I2013al) and 



dStovanov et all 1201 ll) it would be good to go beyond 



the kind of factorized prediction common in variational 
and loopy belief propagation inference. One idea is to 
represent the estimated joint distribution of the predicted 
variables (given the clamped variables) by a powerful 
model such as an RBM or a regularized auto-encoder, 



Although conditional RBMs have been already ex- 
plored, conditional distributions provided by regular- 
ized auto-encoders remain to be studied. Alternatively, 
a denoising auto-encoder (whether it is shallow or 
deep) with masking noised is trained to perform some- 
thing very similar to generalized pseudo-likelihood. 
Note that sampling algorithms based on Langevin or 
Metrop olis-Hasting s MCM C have already be e n pro- 
posed (|Rifai et al. 



2012bt lAlain and Bengiol | 20i; 
Bengio et q/.ll2012l) . for regularized auto-encodero and 



they could easily be adapted to conditional sampling by 
clamping the fixed inputs and (optionally, to increase 
representational capacity) by making the hidden unit bi- 
ases an arbitrarily complex (but deterministic) functions 
of the observed inputs. These theoretical analyses and 
sampling methods for regularized auto-encoders have 
been performed for the case of continuous inputs with 
squared error, and remain to be generalized to discrete 
inputs. 

As as refinement, and in the spirit of a long tradition of 
discriminatively oriented machine learning, when some 
of the observed variables y are of particular interest (be- 
cause we often want to predict them), one would nat- 
urally present examples of the prediction of y given x 
more often to the learning algorithm than random sub- 
sets of observed variables. Hybrids of generative and dis- 
crimina nt training criteria have been very successful for 



RBMs ( ILarochelle and Bengiol l2008t lLarochelle et al. 



20121) and would make practical sense here as well. 

All these ideas lead to the question: what is the inter- 
pretation of hidden layers, if not directly of the under- 
lying generative latent factors? The answer may simply 
be that they provide a better representation of these fac- 
tors, a subject discussed in the next section. But what 
about the representation of uncertainty about these fac- 
tors? The author believes that humans and other animals 
carry in their head an internal representation that implic- 
itly captures both the most likely interpretation of any 
of these factors (in case a hard decision about some of 
them has to be taken) and uncertainty about their joint 
assignment. This is of course a speculation. Somehow, 
our brain would be operating on implicit representations 
of the joint distribution between these explanatory fac- 



24 in which some of the inputs are set to and the auto-encoder 
is trying to predict them, as well as the rest, in its reconstruc- 
tion 

These methods iterate between encoding, decoding, and in- 
jecting noise, with the possibility of rejecting poor configu- 
rations 
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tors, generally without having to commit until a decision 
is required or somehow provoked by our attention mech- 
anisms (which seem related to our tendancy to verbalize 
a discrete interpretation). A good example is foreign lan- 
guage understanding for a person who does not master 
that foreign language. Until we consciously think about 
it, we generally don't commit to a particular meaning for 
ambiguous word (which would be required by MAP in- 
ference), or even to the segmentation of the speech in 
words, but we can take a hard decision that depends on 
the interpretation of these words if we have to, with- 
out having to go through this intermediate step of dis- 
crete interpretation, instead treating the ambiguous in- 
formation as soft cues that may inform our decision. In 
that example, a factorized posterior is also inadequate 
because some word interpretations are more compatible 
with each other. 

To summarize, what we propose here, unlike in previ- 
ous work on approximate inference, is to drop the pre- 
tense that the learned approximate inference mechanism 
actually approximates the latent variables distribution, 
mode, or expected value. Instead, we only consider the 
approximate inference over observed variables (or of val- 
ues of features computed from the observed variables at a 
higher level of a deep architecture) and we consider that 
this mechanism is itself the model, rather than some ap- 
proximation, and we train it with a training criterion that 
is consistent with that interpretation. By removing the in- 
terpretation of approximately marginalizing over latent 
variables, we free ourselves from a strong constraint and 
open the door to any parametrized computation which 
has the requirement that its parameters can be shared 
across any question/answer subset. 

This discussion is of course orthogonal to the use of 
Bayesian averaging methods in order to produce better- 
generalizing predictions, i.e., handling uncertainty due 
to a small number of training examples. The proposed 
methods can be made Bayesia n just like n eural networks 
have their Bayesian variants dNeali 1994), by somehow 
maintaining an implicit or explicit distribution over pa- 
rameters . A promising step in this direction was pro- 
posed bv lWelling and Tehl (1201 II) . making such Bayesian 
computation tractable by exploiting the randomness in- 
troduced with stochastic gradient descent to also produce 
the Bayesian samples over the uncertain parameter val- 
ues. 

6 Disentangling 

6.1 Disentangling: The Challenge 

What are "underlying factors" explaining the data? The 
answer is not obvious. One answer could be that these 



are factors that can be separately controlled (one could 
set up way to change one but not the others). This can 
actually be observed by looking at sequential real-world 
data, where only a small proportion of the factors typi- 
cally change from t to t + 1. Complex data arise from 
the rich interaction of many sources. These factors in- 
teract in a complex web that can complicate Al-related 
tasks such as object classification. If we could identity 
and separate out these factors (i.e., disentangle them), we 
would have almost solved the learning problem. For ex- 
ample, an image is composed of the interaction between 
one or more light sources, the object shapes and the ma- 
terial properties of the various surfaces present in the im- 
age. It is important to distinguish between the related but 
distinct goals of learning invariant features and learning 
to disentangle explanatory factors. The central difference 
is the preservation of information. Invariant features, by 
definition, have reduced sensitivity in the directions of 
invariance. This is the goal of building features that are 
insensitive to variation in the data that are uninformative 
to the task at hand. Unfortunately, it is often difficult to 
determine a priori which set of features and variations 
will ultimately be relevant to the task at hand. Further, as 
is often the case in the context of deep learning methods, 
the feature set being trained may be destined to be used in 
multiple tasks that may have distinct subsets of relevant 
features. Considerations such as these lead us to the con- 
clusion that the most robust approach to feature learning 
is to disentangle as many factors as possible, discarding 
as little information about the data as is practical. 

Deep learning algorithms that can do a much bet- 
ter job of disentangling the underlying factors of vari- 
ation would have tremendous impact. For example, sup- 
pose that the underlying factors can be "guessed" (pre- 
dicted) from a simple (e.g. linear) transformation of the 
learned representation, ideally a transformation that only 
depends on a few elements of the representation. That is 
what we mean by a representation that disentangles the 
underlying factors. It would clearly make learning a new 
supervised task (which may be related to one or a few 
of them) much easier, because the supervised learning 
could quickly leam those linear factors, zooming in on 
the parts of the representation that are relevant. 

Of all the challenges discussed in this paper, this is 
probably the most ambitious, and success in solving it 
the most likely to have far-reaching impact. In addition 
to the obvious observation that disentangling the under- 
lying factors is almost like pre-solving any possible task 
relevant to the observed data, having disentangled repre- 
sentations would also solve other issues, such as the issue 
of mixing between modes. We believe that it would also 
considerably reduce the optimization problems involved 
when new information arrives and has to be reconciled 
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with the world model implicit in the current parameter 
setting. Indeed, it would allow only changing the parts 
of the model that involve the factors that are relevant to 
the new observation, in the spirit of sparse updates and 
reduced ill-conditioning discussed above. 

6.2 Disentangling: Solution Paths 

Deeper Representations Disentangle Better. There are 
some encouraging signs that our current unsupervised 
representation-learning algorithms are reducing the "en- 
tanglement" of the underlying factors^ when we apply 
them to raw data (or to the output of a previous represen- 
tation learning procedure, like when we stack RBMs or 
regularized auto-encoders). 

First, there are experimental observations suggest- 
ing that sparse convolutional RBMs and sparse de- 
noising auto-encoders achieve in their hidden units 
a gre ater degree of di s entang l ing than in t heir in - 



puts (Goodfell ow et all 120091: iGlorot et all l2011bl) . 



What these authors found is that some hidden units were 
particularly sensitive to a known factor of variation while 
being rather insensitive (i.e., invariant) to others. For 
example, in a sentiment analysis model that sees unla- 
beled paragraphs of customer comments from the Ama- 
zon web site, some hidden units specialized on the topic 
of the paragraph (the type of product being evaluated, 
e.g., book, video, music) while other units specialized 
on the sentiment (positive vs negative). The disentan- 
glement was never perfect, so the authors made quan- 
titative measurements of sensitivity and invariance and 
compared these quantities on the input and the output 
(learned representation) of the unsupervised learners. 

Another encouraging observation (already mentioned 
in the section on mixing) is that deeper representa- 
tions were empirically found to be more amenable to 
quickly mixing between modes dBengio et al. , 2013bl) . 
Two (compatible) hypotheses were proposed to ex- 
plain this observation: (1) RBMs and regularized auto- 
encoders deterministically transform^ their input distri- 
bution into one that is more uniform-looking, that bet- 
ter fills the space (thus creating easier paths between 
modes), and (2) these algorithms tend to discover repre- 
sentations that are more disentangled. The advantage of 
a higher-level disentangled representation is that a small 
MCMC step (e.g. Gibbs) in that space (e.g. flipping one 
high-level variable) can move in one step from one input- 
level mode to a distant one, e.g., going from one shape 



as measured by how predictive some individual features are 
of known factors 
27 when considering the features learned, e.g., the P(hi = 1 | 
x), for RBMs 



/ object to another one, adding or removing glasses on 
the face of a person (which requires a very sharp coordi- 
nation of pixels far from each other because glasses oc- 
cupy a very thin image area), or replacing foreground and 
background colors (such as going into a "reverse video" 
mode). 

Although these observations are encouraging, we do 
not yet have a clear understanding as to why some rep- 
resentation algorithms tend to move towards more disen- 
tangled representations, and there are other experimental 
observations suggesting that this is far fr om sufficient. 
In particular, iGulcehre and Bengiol (|20 1 3b show an ex- 
ample of a task on which deep supervised nets (and ev- 
ery other black-box machine learning algorithm tried) 
fail, on which a completely disentangled input repre- 
sentati on makes the task feas ible (with a maxout net- 
work (Goodfel low et al. . 2013bh ~). Unfortunately, unsu- 
pervised pre-training applied on the raw input images 
failed to produce enough disentangling to solve the task, 
even with the appropriate convolutional structure. What 
is interesting is that we now have a simple artificial task 
on which we can evaluate new unsupervised represen- 
tation learning methods for their disentangling ability. 
It may be that a variant of the current algorithms will 
eventually succeed at this task, or it may be that alto- 
gether different unsupervised representation learning al- 
gorithms are needed. 



Generic Priors for Disentangling Fact ors of Varia- 

A general strategy was outlined in 



tion. 



Bengio et al. 



(1201 3d) to enhance the discovery of representations 
which disentangle the underlying and unknown factors 
of variation: it relies on exploiting priors about these fac- 
tors. We are most interested in broad generic priors that 
can be useful for a large class of learning problems of 
interest in AL We list these priors here: 

• Smoothness: assumes the function / to be learned is 
s.t. x sa y generally implies f(x) ss f(y)- This most 
basic prior is present in most machine learning, but is in- 
sufficient to get around the curse of dimensionality. 

• Multiple explanatory factors: the data generating dis- 
tribution is generated by different underlying factors, 
and for the most part what one learns about one fac- 
tor generalizes in many configurations of the other fac- 
tors. The objective is to recover or at least disentangle 
these underlying factors of variation. This assumption is 
behind the idea of distributed representations. More 
specific priors on the form of the model can be used 
to enhance disentangling, su ch as multiplicative inter 



actions between the factors dTenenbaum and Freeman , 



2000; iDesiardins et all 2012) or orthogon ality of the 
features derivative with respect to the input (Rifai et al. 
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201 lbl l2012at ISohn et all 120131) . The parametrization 
and training procedure may also be used to disentan- 
gle discrete factors (e.g., detecting a shape) from as- 
sociated continuous-valued factors (e.g., pose param- 
eters), as in transforming auto-encoders (IHinton et al. 



rs), 
ill), 



201 11) , spike-and-slab RBM s with pooled slab vari- 
ables ( Courville et al. . 201 lh and oth er pooling-based 
models that learn a feature subspace (Ko honen , 1996; 
Hvvarinen and Hover, 2000). 

• A hierarchical organization of explanatory factors: 

the concepts that are useful for describing the world 
around us can be defined in terms of other concepts, in 
a hierarchy, with more abstract concepts higher in the 
hierarchy, defined in terms of less abstract ones. This 
assumption is exploited with deep representations. Al- 
though stacking single-layer models has been rather suc- 
cessful, much remains to be done regarding the joint 
training of all the layers of a deep unsupervised model. 

• Semi-supervised learning: with inputs X and target 

Y to predict, given X, a subset of the factors explaining 
X's distribution explain much of Y, given X. Hence rep- 
resentations that are useful for spelling out P(X) tend 
to be useful when learning P(Y \ X), allowing shar- 
ing of statistical strength between the unsupervised and 
supervised learning tasks. However, many of the factors 
that explain X may dominate those that also explain Y, 
which can make it useful to incorporate observations of 

Y in training the learned representations, i.e., by semi- 
supervised representation learning. 

• Shared factors across tasks: with many Y's of interest 
or many learning tasks in general, tasks (e.g., the corre- 
sponding P(Y | X, task)) are explained by factors that 
are shared with other tasks, allowing sharing of statisti- 
cal strength across tasks, e.g. for multi-task and transfer 
learning or domain adaptation. This can be achieved by 
sharin g embeddings or representation functions acros s 



tasks dCollobert and Westonl,l2008tlBordes et all 12013b . 

• Manifolds: probability mass concentrates near regions 
that have a much smaller dimensionality than the original 
space where the data lives. This is exploited with regu- 
larized auto-encoder algorithms, but training criteria that 
would explicitly take into account that we are looking for 
a concentration of mass in an integral number directions 
remain to be developed. 

• Natural clustering: different values of categorical 
variables such as object classes are associated with sep- 
arate manifolds. More precisely, the local variations on 
the manifold tend to preserve the value of a category, 
and a linear interpolation between examples of different 
classes in general involves going through a low density 
region, i.e., P(X \Y = i) for different i tend to be well 
separated and not overlap much. For exampl e, this is ex 
ploited in the Manifold Tangent Classifier (Rif ai et al. 



5; 



2011b). This hypothesis is consistent with the idea that 
humans have named categories and classes because of 
such statistical structure (discovered by their brain and 
propagated by their culture), and machine learning tasks 
often involves predicting such categorical variables. 

• Temp oral and spatial cohere nce: this prior intro- 
duced in Becker and Hintonldl992l) is similar to the natu- 
ral clustering assumption but concerns sequences of ob- 
servations: consecutive (from a sequence) or spatially 
nearby observations tend to be easily predictable from 
each other. In the spe cial case typically stud i ed, e. 
slow feature analysis (IWiskott and SejnowskiL 12002 
one assumes that consecutive values are close to each 
other, or that categorical concepts remain either present 
or absent for most of the transitions. More generally, 
different underlying factors change at different tempo- 
ral and spatial scales, and this could be exploited to sift 
different factors into different categories based on their 
temporal scale. 

• Sparsity: for any given observation x, only a 
small fraction of the possible factors are relevant. 
In terms of representation, this could be represented 
by features that are of t en zer o (as initially proposed 
by lOlshausen and Field! (119961) 1 or more generally by 
the fact that most of the extracted features are insensi- 
tive to small variations of x. This can be achieved with 
certain forms of priors on latent variables (peaked at 0), 
or by using a non-linearity whose value is often flat at 
(i.e., and with a derivative), or simply by penalizing 
the magnitude of the derivatives of the function mapping 
input to representation. A variant on that hypothesis is 
that for any given input, only a small part of the model is 
relevant and only a small subset of the parameters need 
to be updated. 

• Simplicity of Factor Dependencies: in good high- 
level representations, the factors are related to each other 
through simple, typically linear, dependencies. This can 
be seen in many laws of physics, and is assumed when 
plugging a linear predictor on top of a learned represen- 
tation. 

7 Conclusion 

Deep learning and more generally representation learn- 
ing are recent areas of investigation in machine learning 
and recent years of research have allowed to clearly iden- 
tify several major challenges for approaching the per- 
formance of these algorithms from that of humans. We 
have broken down these challenges into four major ar- 
eas: scaling computations, reducing the difficulties in op- 
timizing parameters, designing (or avoiding) expensive 
inference and sampling, and helping to learn represen- 
tations that better disentangle the unknown underlying 



factors of variation. There is room for exploring many 
paths towards addressing all of these issues, and we have 
presented here a few appealing directions of research to- 
wards these challenges. 
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